--- layout: page title: Data Visualisation permalink: /Rcoding/coding1/ parent: R Coding nav_order: 1 --- Script 2: Measurement


Main Commands Introduced in this Session

barplot() #creates a barplot
hist() #creates a histogram
boxplot() # creates a boxplot
plot() #creates a basic R plot - of various kinds
abline() # adding lines to a plot
cor() #correlation coeffcient
sd() #standard deviation of a variable
var() #variance of a variable
is.na() # Shows whether a value is missing or not
rm.na=T #removing missings (true/false)

Measurement

In this lab we will be working with data and answering descriptive research questions. We will use a cross-sectional dataset from the Quality of Government project, an online service that collects prominent country-level variables. Our dataset contains information about a variety of national characteristics for 193 countries as measured in 2014. In this lab we will focus on descriptive statistics, which in themselves can reveal a lot of interesting information.


Working with Real Data

Setting a Working Directory & Loading Data using RStudio

When starting a new data analysis using RStudio on our desktop, we should clear our workspace and load in the data. We will read in our data using the read.csv() function. R can handle a wide variety of data sources such as Stata, SPSS or Excel files, as well as files that are stored online. Our dataset here is a text file where the values are separated by commas - a .csv file. This is the procedure to follow:

• Create a folder for script 2, download the qog.csv data using the button above and move it into this folder.

• Open RStudio. Make sure to save and close any open R-scripts. You can clear your environment by clicking the broom icon in the top right RStudio panel. You can also remove all objects using the code: rm(list=ls()). Use this line with extreme caution – it does not ask for confirmation, and if you are working on other projects in the same window it will also remove all those objects too!

• Open a new R-Script and use the code below to read in the dataset. Note that we are assigning the data the name qog, using the <- operator. The variables in this dataset are listed below.

# Set working directory to the folder where 'qog.csv' is

setwd("C:/Users/Desktop/data_analysis_R/script_2")
#you don't need to set the working directory if you are working on Rstudio cloud, you just need to upload the data

qog <- read.csv("qog.csv") # read in the data

Remember, it can be extremely helpful to add a table with the description (descriptive statistics - we will learn more about that today) of your data.

Variable Name Variable Description
iso3c ISO 3 character country code
country Country name
year Year
region Region
income World Bank Income Group
electoral_fraud Serious electoral fraud in last election? (dummy)
proportional_rep Proportional representation? (dummy)
air_quality Air quality (higher is best)
fragile_state_index Fragile State Index (higher scores more fragile)
educ_years1524_female Years of education for females, aged 15-24
educ_years1524_male Years of education for males, aged 15-24
govt_revenue Government revenue as % of GDP
polity 21-point Polity score (-10 most autocratic; +10 most democratic)
freedomhouse_status Freedom House assessment of freedom
corruption_perceptions_index Corruption Perceptions Index (lower is more corrupt)
human_devt_index Human Development Index (higher is more developed)
access_electricity % of population with access to electricity
gdp_percapita GDP per capita (in dollars)
population Population
gdp GDP total, converted at purchasing power parity

Exploring the Dataset

Whenever we have a new dataset it is a good idea to explore our dataset to see how many variables and observations there are (you can simply use dim() for this purpose), what types of variables there are (class()), what their names are (names()) and what some of their typical values are (summary()). It is also often helpful to visualize the dataset (View()). In the previous script, we explored the leaders assassination dataset using head(), tail(), names(), and str(). Recall that we can access variables in a data.frame using the object[row,column] notation or the $ notation.

dim(qog) # dimensions of the dataset
names(qog) # variable names
str(qog) # gives the structure of the dataset and variable types
class(qog) # the type of object
View(qog) # shows the whole data frame

Measures of Central Tendency

Now that we have gotten a feel for the dataset, let’s delve into some of the more interesting descriptive statistics. Previously, we looked at one summary statistic of central tendency: the mean. This time, we will look at additional measures of central tendency (the mean –again!– , median, and mode), as well as measures of spread (variance and standard deviation). We will also think about how to visualise these measures.

Central tendency is a way to think about what an “average” value of a variable might be. The best measure of central tendency will depend on the type of variable we are working with (which we can sometimes proxy for using the class() function) as well as its dispersion.

Mode and Bar Plots

For categorical variables, the measure of central tendency is the mode. The mode is the value that appears most often in a set of data values. In other words, it is the value that is most likely to be sampled. We can find the mode of a categorical variable indirectly with the table() or with a bar plot. Bar plots provide good summaries of a categorical or ordinal variable.

• What is the region of the world with the most countries? The region variable denotes which region of the world a country is located in.

table(qog$region) # Gives a count
region.ptable <- prop.table(table(qog$region))
region.ptable # Gives a proportion

You can also visualise this by using a simple command creating a bar plot.

barplot(table(qog$region)) 

Let’s do the same, but this time using the proportions we just calculated:

barplot(region.ptable)

This is indsightful indeed. However, the plot is not very clear and a reader might not necessarily understand what it is supposed to convey. This is why annotating figures and plots is absolutely essential - always annotate your plots and figures!

Let’s add a title (using the main argument; axis labels using xlab and ylab) as well as the names of the regions (names.arg) and their script size (cex.names). We can add \n to include a line break within labels. Note that different plot types might require different arguments - but the rest is up to your choice!

barplot(region.ptable,
        main = "Distribution of countries by region",
        xlab= "Region",
        ylab="Proportion of countries",
        names.arg = c("East Asia", "Europe & \nCentral Asia",
"Latin America & \nCaribbean", "Middle East & \nNorth Africa",
"North America", "South Asia", "Sub-Saharan \nAfrica"),
cex.names = 0.3)

This looks much better - and should be much easier to understand.

Let’s look at something more interesting now: What is the most common regime type according to Freedom House?

table(qog$freedomhouse_status)
barplot(prop.table(table(qog$freedomhouse_status)))

Mean, Median and Missing Data

The mean is probably the most common measure of central tendency. Recall the formula for calculating the mean:

\[\bar{x}=\frac{1}{n} \sum_{i=1}^n x_i\]

The median is also commonly used, particularly when a variable has a skewed distribution (we’ll look into this soon). The median is the value at a variable’s 50th percentile — the halfway point in its range. Both are appropriate for continuous variables — which R may refer to as numeric or integer variables.

• Calculate the mean of the United Nations Development Programme’s Human Development Indicator (human_devt_index) using the following code:

class(qog$human_devt_index) # check the class of the variable
mean(qog$human_devt_index)

Why do we get NA?

As a default R assumes there are no missing values in the data, however in many (most!) datasets there will be (quite a few!) missing values. Many commands in R will not work if there is missing data unless a specific “addition” is added to the code to tell R how to process the missing data.

To see if there are missing values for a variable we can call is.na(), which evaluates whether a value is missing or not. We can also call sum(is.na()) to count how many values are missing in a given dataset, or look it up with summary():

is.na(qog$human_devt_index)
sum(is.na(qog$human_devt_index))
summary(qog$human_devt_index)

To get R to return the mean ignoring missing values we add na.rm=TRUE as an additional argument to the mean function. This parameter tells R to omit observations with missing data for a particular variable when calculating the mean or median. Note that different commands may have different NA arguments, especially user-written ones.

mean(qog$human_devt_index, na.rm=TRUE)
median(qog$human_devt_index, na.rm=TRUE)

Measures of Spread

The median and mean summarise data into a single value that is typical or most representative of all the values of a variable. But these “central tendencies” are only a part of the picture that summarizes a variable. Measures of spread (or measures of dispersion) summarize data in a way that shows how dispersed or concentrated values are. The range, quantiles, variance, and standard deviation all are measures of spread.

• The range is the difference between the smallest value (min()) and the largest value (max()) in dataset.

Quantiles (quantile()) sort a variable from minimum to maximum and identify the value of a variable at a chosen point (0.00–1.00).

• The variance (var()) and the standard deviation (sd()) are measures of the spread of the data around the mean. They summarize how tightly or loosely observations are clustered around the mean.

Range and Quantiles

The most basic description of a continuous variable’s spread is its range–the difference between the largest and smallest values:

max(qog$human_devt_index, na.rm = TRUE) - min(qog$human_devt_index, na.rm = TRUE)

But the range alone cannot tell us how observations are distributed between the minimum and maximum values - extreme values might be very uncommon and distort our view on the data. To get a better sense of how the data is distributed we can use quantiles to divide our data into groups. Commonly values are divided into quartiles — four even subsets of a variable. The first quartile (or lower quartile) is the value under which the first 25% of the observations fall, while the proportion of observations below the fourth (or upper) quartile is 75%. The second quartile equals the median. The difference between upper and lower quartiles is called the interquartile range or IQR. An advantage of the IQR over standard deviation is that it is not sensitive to outliers.

• Compute the 25th quantile and the IQR of the UNDP HDI variable - there are various functions one could use.

quantile(qog$human_devt_index, 0.25, na.rm = TRUE) # Specifies the lower quantile
# The following displays min, max, median, and lower (25%) and upper (75%) quantiles
quantile(qog$human_devt_index, na.rm = TRUE)
IQR(qog$human_devt_index, na.rm=TRUE)

Standard Deviation and Variance

The standard deviation (\(σ\)) measures the spread of the distribution by quantifying how much far away data points are, on average, from their mean. Each observation has a deviation, i.e. a distance, from the mean. A deviation is positive when an observation falls above mean, and negative when it falls below the mean. Recall how we calculated the mean - by construction, the sum of all deviations around the mean equals 0. Hence, to get a measure of overall variability, we need to use either absolute values or the squared sum of all deviations from the mean.

The average squared sum of all deviations from the mean is called the variance, and its squared root is the standard deviation.

\[\sigma=\sqrt{\frac{1}{n-1} \sum_{i=1}^n\left(x_i-\bar{x}\right)^2}\]

The standard deviation is easier to interpret as it is expressed in the same unit as the mean. The larger the standard deviation, the greater the spread of the data. The value of \(σ\) is zero when all observations have the exact same value – i.e., when there is no spread. For variables on the same scale, a distribution with a larger standard deviation has a greater variability than one with lower standard deviation. But how do we interpret the size of a standard deviation? We can do so by comparing it to the mean.

• Calculate the standard deviation and variance of the UNDP HDI variable.

• What is the expected value of the UNDP HDI variable at one standard deviation above the mean?

sd(qog$human_devt_index, na.rm=TRUE)
var(qog$human_devt_index, na.rm=TRUE)
mean(qog$human_devt_index, na.rm=TRUE) + sd(qog$human_devt_index, na.rm=TRUE)

Histograms

Histograms are the most common way to summarize the distribution of a continuous variable. A histogram breaks the distribution of the data into bins and counts the number of observations in each bin.

• Create a histogram of the UNDP HDI variable.

hist(qog$human_devt_index)
hist(qog$human_devt_index, freq=FALSE) #provides density instead of frequency count

If the freq argument is set to FALSE, the height of the bins are given by their “density”, which is computed as the proportion of observations in the bin divided by the width of the bin.

We can tell R how we would like the bins set in either of two ways: by selecting the number of bins or their characteristics, selecting where the bins should start, end, and their width.

• Now create a histogram and set bin widths in an intelligible way.

hist(qog$human_devt_index,
breaks=10, xlim= c(0,1))

Here, we indicate that we want 10 breaks, which would result in 11 distinct bins. We also indicate the range of the x axis, i.e. the values we want to see represented. What do you see?

There are no observations at very low values - this means the 10 bins start at the minimum and don’t line up intuitively.

Let’s check the minimum and maximum values of the variable

summary(qog$human_devt_index)
# or
min(qog$human_devt_index, na.rm=TRUE) 
max(qog$human_devt_index, na.rm=TRUE)

Let’s now create a more appropriate histogram. What do you see now?

hist(qog$human_devt_index,
breaks=seq(from=.3, to =1, by=.05), xlim=c(0.3,1),
main="Distribution of HDI",
xlab="UNDP Human Development Indicator")

• Identify the mean and median of the UNDP HDI variable on the histogram.

• Identify observations one standard deviation above and below the mean.

Note: abline allows you to add lines to plots, indicating the v argument indicates vertical lines. You can then feed in means, medians and other statistics.

hist(qog$human_devt_index,
     breaks=seq(from=.3, to =1, by=.05), xlim=c(0.3,1),
     main="Distribution of HDI",
     xlab="UNDP Human Development Indicator")
abline(v=median(qog$human_devt_index, na.rm=TRUE), col="red") # add vertical line at the median
abline(v=mean(qog$human_devt_index, na.rm=TRUE), col="blue") # add vertical line at the mean
abline(v=mean(qog$human_devt_index, na.rm=TRUE) + sd(qog$human_devt_index, na.rm=TRUE),
col="blue", lty="dashed") # add vertical line at +1sd
abline(v=mean(qog$human_devt_index, na.rm=TRUE) - sd(qog$human_devt_index, na.rm=TRUE),
col="blue", lty="dashed") # add vertical line at -1sd

Finally, because bins are discrete, but our data is actually continuous, the bins can be a confusing way to represent the distribution of values. Another way to visualize the distribution of a variable is with the kernel density, which smooths the distribution (think of it like having an infinite number of infinitely small bins).

plot(density(qog$human_devt_index, na.rm = TRUE))

As you can see, the only difference between the histogram with frequency and the one with density is the scale of the y-axis. Remember that you can show the density instead of the frequency on the y-axis by specifying the option freq=FALSE.

Box Plots

Box plots also let us summarize the distribution of a continuous variable. They can be very effective when there are subgroups that we would like to distinguish between. A boxplot is composed of several elements:

• The line that divides the box into 2 parts represents the median of the data.

• The box extends from the lower to the upper quartiles. Hence, it represents the IQR, so that 50% of the data are located within the box.

• The dashed line - ‘whiskers’ - spreads from up to 1.5 times the IQR below the median to up 1.5 times the IQR above the median. The extremes represent the highest and lowest values of the distribution (excluding outliers).

• Dots beyond the extreme line show outliers, i.e. observations which are more distant from the median than 1.5 times the IQR.

The following command generates a boxplot of the HDI variable. We can also specify ~ to generate multiple boxplots of the HDI separated by the values of other categorical variables:

boxplot(qog$human_devt_index)
# distribution of HDI by Freedom House status
boxplot(qog$human_devt_index ~ qog$freedomhouse_status)
boxplot(qog$human_devt_index ~ qog$region,
cex = 0.45) # distribution of HDI by region

We can also think about how the mean (which is not shown in a box plot) relates to this distribution using the tapply() function. The tapply(x, index, fun) function takes an outcome variable (x), and applies the function fun within each level of the index. In the following example, tapply gives the mean value of the HDI for each level of the Freedom House score, first, and for each region, ìn the second line.

tapply(qog$human_devt_index, qog$freedomhouse_status, mean, na.rm=TRUE)
tapply(qog$human_devt_index, qog$region, mean, na.rm=TRUE)

To summarize, here is a list of levels of measurement of a variable and the corresponding descriptive statistics and plots we have used so far. Note that this list is not exhaustive - these descriptive statistics might well be visualised in other ways.

Level of Measurement Descriptive Statistics Plot Type
Nominal Frequencies Bar Plot
(no order) Proportions
Mode
Ordinal Frequencies Bar Plot
(ordered categories) Median, Mode, (Mean) Box Plot
Continuous Median, Mode, Mean Histogram
(uniform distance) Quantile, SD Density Plot
Range, IQR Box Plot



Summarising Bivariate Relationships

Now that we have explored basic descriptive statistics, we can delve a bit deeper. As a final step we will look at associations and correlations between two variables.

Scatter Plots

What values does a variable take at different values of another variable? We can plot the relationship between two variables with the plot() command to create a scatter plot. Scatter plots are a very powerful way to get a sense of how two variables are related. It is almost always a good idea to have a look at a scatter plot when you first investigate a new variable of interest.

We might think, for instance, that there is an association between HDI and GDP per capita (gdp_percapita). Something would actually be wrong if there weren’t, since GDP per capita is one of the factors that compose the HDI.

• Create a scatter plot comparing the HDI and GDP per capita.

plot(qog$human_devt_index, qog$gdp_percapita) # plot(x-values, y-values)

What are some other likely associations – or non-associations – in our dataset?

plot(qog$human_devt_index, qog$polity)
plot(qog$human_devt_index, qog$access_electricity)
plot(qog$human_devt_index, qog$corruption_perceptions_index)
plot(qog$human_devt_index, qog$fragile_state_index)

Correlation

The associations we see in scatter plots may be strong, weak, or mixed. Another way to gauge the extent of an association is to consider the correlation between two variables. The correlation coefficient tells us how, on average, two variables move together relative to their means. The formula for the correlation coefficient is the following:

\[\rho_{x y}=\frac{1}{n-1} \sum_{i=1}^n\left(\frac{x_i-\bar{x}}{\sigma_x} \times \frac{y_i-\bar{y}}{\sigma_y}\right)\]

Pearson’s correlation coefficient ranges from -1 (perfect negative correlation) through 0 (no correlation whatsoever) and finally to +1 (perfect positive correlation).

If there is missing data, we must tell R not to use those observations in its calculation. We do so with the additional use = "pairwise.complete.obs" argument - this instructs R only to use observations where there is no missing value for either of the two variables.

cor(qog$human_devt_index, qog$polity, use="pairwise.complete.obs")
cor(qog$human_devt_index, qog$access_electricity, use="pairwise.complete.obs")
cor(qog$human_devt_index, qog$corruption_perceptions_index, use="pairwise.complete.obs")
cor(qog$human_devt_index, qog$fragile_state_index, use="pairwise.complete.obs")

Finally, assess the relationship between democracy measured by the Polity score (polity) and state fragility as measured by the Fragile States Index (fragile_state_index).

• Is democracy correlated with state fragility?

• Produce a scatterplot that shows the association between democracy (x-axis) and state fragility (y-axis).

• How do the two variables tend to move together? Add lines to the plot at the mean of each variable.

• Identify Greece and Italy on the plot.

To answer these questions, let’s start with correlation:

cor(qog$polity, qog$fragile_state_index, use="pairwise.complete.obs")

Let’s do the scatter plot next, adding lines for the variables’ means - feel free to add colours to highlight the means:

plot(qog$polity, qog$fragile_state_index,
     main="Relationship between Polity score and State Fragility Index",
     xlab="Polity score", ylab="State Fragility Index")
abline(v=mean(qog$polity, na.rm=TRUE))
abline(h=mean(qog$fragile_state_index, na.rm=TRUE))

You can then identify specific countries - recall how we indexed values in our last session:

# Italy
print(qog$polity[qog$country=="Italy"])
print(qog$fragile_state_index[qog$country=="Italy"])
# Greece
print(qog$polity[qog$country=="Greece"])
print(qog$fragile_state_index[qog$country=="Greece"])

Using the `points command, you can these countries to your scatter plot and adjust colours and point sizes:

# Add them to the plot
points(qog$polity[qog$country=="Italy"],
qog$fragile_state_index[qog$country=="Italy"],
col="red", pch=17)
points(qog$polity[qog$country=="Greece"],
qog$fragile_state_index[qog$country=="Greece"],
col="red", pch=17)

Now identify a country of your choice!

Let’s conclude by identifying more specific cases - this might come in handy if you see an interesting data point you would like to learn more about.

• Use indexing and subsetting to identify the most fragile state with polity==9.

• Use indexing and subsetting to identify the outlier “non-fragile” state toward the centre of the Polity score.

#Use indexing and subsetting to identify the most fragile state with `polity==9`
max(qog$fragile_state_index[qog$polity==9], na.rm=TRUE)
qog$country[qog$fragile_state_index==97.3]
qog$country[qog$fragile_state_index == 97.3 & !is.na(qog$fragile_state_index)]
qog$country[qog$fragile_state_index==97.3& !is.na(qog$fragile_state_index)&qog$polity==9]

We first identify the maximum value among all states with Polity values of nine. However, several observations might have the same value, at least in principle, so we can’t be sure we have our wanted state. Thus, we further specify the conditions to Polity scores of nine, a fragile state index of 97.3 and a none missing fragile state index.

We can do the same for our outlier and actually write all this in one line of code:

#identify the "non-fragile" state toward the centre of the Polity score.
qog$country[qog$fragile_state_index==min(qog$fragile_state_index[qog$polity==-2], na.rm=TRUE)]

Let’s finally recreate the scatter plot with country abbreviations as markers - we do this by producing the plot before adding the text command with three arguments: the x-axis, y-axis and the identifier variable that we want to plot:

plot(qog$polity, qog$fragile_state_index, type="n",
main="Relationship between Polity score and State Fragility Index",
xlab="Polity score", ylab="State Fragility Index",
cex = 0.5)
text(qog$polity, qog$fragile_state_index, qog$iso3c)

How does this plot look like?


Excercise: Hit or Miss? Success of Leader Assassination as a Natural Experiment

For this exercise, you will re-visit the leaders’ assassination dataset from before. This time, we’ll be looking at more substantive descriptive statistics that will help us understand the consequences of political assassinations. Recall from last time that Jones and Olken (2009) treated the success of an assassination attempt as randomly determined: sometimes they succeed and sometimes they don’t, and the variation in success is itself not a function of pre-existing regime characteristics or post-assassination attempt potential outcomes/counterfactuals. If this is true, then the leader assassinations might constitute natural experiments that researchers can use to answer causal questions.

Each observation of the leaders.csv data set contains information about an assassination attempt. The variables are:

Variable Name Variable Description
country The name of the country
year Year of assassination
leadername Name of leader who was targeted
result Result of the assassination attempt (one of 10 categories)
age Age of the targeted leader
politybefore Average Polity score during the 3 year period prior to the attempt
polityafter Average Polity score during the 3 year period after the attempt
civilwarbefore 1 if country is in civil war during the 3 year period prior to the attempt, or 0 otherwise
civilwarafter 1 if country is in civil war during the 3 year period after the attempt, or 0 otherwise
interwarbefore 1 if country is in interstate war during the 3 year period prior to the attempt, or 0 otherwise
interwarafter 1 if country is in interstate war during the 3 year period after the attempt, or 0 otherwise



Today, you will visualise the substantive outputs from last week - that is, you’ll be creating graphs.

#setwd("...") # set your working directory
leaders <- read.csv("leaders.csv")

Task 1

Are there more assassination attempts in authoritarian countries or in democratic countries? Use a histogram to compare the incidence of assassination attempts at different values of the polity score. What additional information would we like to have to know whether assassination attempts are more common in authoritarian or democratic countries?

Task 2

Last time, we created a new binary variable indicating whether an assassination attempt succeeded in killing a leader. Using this binary success variable, determine whether the success rate of assassination attempts differs by an important pre-treatment regime characteristic: the Polity score 3 years before the assassination attempt. Last week, you looked at the mean score of this variable for successful and unsuccessful attempts - to look into this in more detail, you’ll be using distribution plots now. Provide a histogram of the pre-treatment Polity score for successful and unsuccessful assassination attempts, respectively. How does the distribution compare to the mean pre-treatment Polity score? Add the mean pre-treatment Polity score to the histograms. Does your graph have an informative title and labels?

Task 3

Histograms are not the only way to compare the distribution of a variable. Use a box plot to summarize the pre-treatment Polity score for successful and unsuccessful assassination attempts.

Which plot do you find the most compelling? Is the distribution of successful assassination attempts random? Is it balanced with respect to pre-treatment Polity scores?

Task 4

• Now compare the pre-assassination attempt Polity scores (politybefore) with their post-assassination attempt Polity scores (polityafter) using a scatter plot.

• Use different coloured markers to identify observations where assassination attempts were successful and unsuccessful.

• Draw a diagonal line through your plot using (abline(0,1)) and assess whether assassination attempts lead to changes in Polity scores.

Feel free to try something new: This is not required, but be aware that you can make your graphs easier to read if you slightly adjust the position of the points on the plot using jitter() i.e.

plot(jitter(var1,1.25),
jitter(var2, 1.25))

Task 5

Last time, you conducted comparisons to determine the effect of leader assassination on democratization outcomes. The effect was small and positive: successful assassinations may move countries towards democracy. Now you will consider the effect more closely.

• First, calculate the change in Polity score for each observation.

• Next, use two histograms to compare the change in Polity score in countries with successful and unsuccessful assassination attempts.

• Provide a brief interpretation of the histograms and the scatterplot in the previous question by connecting your analysis to the results obtained in the difference-in-differences analysis from the previous script.